Midterm project for DATA624 course in CUNY’s MSDS program

Part A - ATM Forecast

#1 Data preparation

First I load in the provided data and take a look at the first few rows. I see that there is an unnecessary hour/minute/second component to our date, so I trim that off.

atm_raw <- read_csv("atm_raw.csv")
head(atm_raw)
## # A tibble: 6 x 3
##   DATE                 ATM    Cash
##   <chr>                <chr> <dbl>
## 1 5/1/2009 12:00:00 AM ATM1     96
## 2 5/1/2009 12:00:00 AM ATM2    107
## 3 5/2/2009 12:00:00 AM ATM1     82
## 4 5/2/2009 12:00:00 AM ATM2     89
## 5 5/3/2009 12:00:00 AM ATM1     85
## 6 5/3/2009 12:00:00 AM ATM2     90
#start cleaned dataset
atm <- atm_raw

#shave off unecessary hour/minute/second of date variable
#atm$DATE <- str_sub(atm$DATE, end=-13)

#set to date
atm$DATE = as.Date(atm$DATE, format='%m/%d/%Y 12:00:00 AM', origin = '1990-01-01')

A first glance at the data I see there is some work to do. I see 14 cases having no value for ATM, these can be removed as I see these are associated with May dates (which we’ll be predicting later) and no Cash value is present. I also see that ATM1 and ATM2 have a couple missing values. ATM3 appears to have a strange distribution with nearly all the percentiles having a value of zero.

atm %>%
  group_by(ATM) %>%
  skim()
Data summary
Name Piped data
Number of rows 1474
Number of columns 3
_______________________
Column type frequency:
Date 1
numeric 1
________________________
Group variables ATM

Variable type: Date

skim_variable ATM n_missing complete_rate min max median n_unique
DATE ATM1 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM2 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM3 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM4 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE NA 0 1 2010-05-01 2010-05-14 2010-05-07 14

Variable type: numeric

skim_variable ATM n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Cash ATM1 3 0.99 83.89 36.66 1 73.0 91 108 180 ▂▁▇▃▁
Cash ATM2 2 0.99 62.58 38.90 0 25.5 67 93 147 ▇▅▇▇▂
Cash ATM3 0 1.00 0.72 7.94 0 0.0 0 0 96 ▇▁▁▁▁
Cash ATM4 0 1.00 474.01 650.95 2 124.0 404 705 10920 ▇▁▁▁▁
Cash NA 14 0.00 NaN NA NA NA NA NA NA

I drop the cases where the ATM variable is NA, as these also had empty Cash values.

#remove cases with ATM=NA
atm <- atm %>%
  drop_na(ATM)

Let’s take a closer look at ATM3 cash values. It appears there are 362 days with a value of zero Cash withdrawn, and 1 day each for 8,200, 8,500, and 9,600.

atm %>%
  filter(ATM == "ATM3") %>%
  group_by(Cash) %>%
  summarize(n = n())
## # A tibble: 4 x 2
##    Cash     n
##   <dbl> <int>
## 1     0   362
## 2    82     1
## 3    85     1
## 4    96     1

Looking at the only cases where ATM3 has greater than zero for the Cash variable, I see that these are the last 3 days of the time series data. This suggests the machine was previously broken or was only installed on 4/28/2010, explaining the lack of data for the majority of this time series dataset. As a prediction based off only 3 data points would not be wise, I’ll disregard ATM3 moving forward and relay to my boss that we’ll need more data before we can forecast for ATM3.

atm %>%
  filter(ATM == "ATM3",
         Cash > 0)
## # A tibble: 3 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-28 ATM3     96
## 2 2010-04-29 ATM3     82
## 3 2010-04-30 ATM3     85

Finally, I’ll set each ATM to it’s own dataframe.

atm$Cash <- as.numeric(unlist(atm$Cash))


atm1 <- atm %>%
  filter(ATM == "ATM1")

atm2 <- atm %>%
  filter(ATM == "ATM2")

atm4 <- atm %>%
  filter(ATM == "ATM4")

Imputation

Next I’ll see how the imputeTS package would impute the handful of NA values using the linear (default) imputation method.

ATM1

The red values it’s suggesting appear reasonable to my eye, and considering it’s a very small proportion of the full dataset (0.8%) this is acceptable. ###consider seasonally decomposed imputation?? na_seadec()

#store imputation
imputation <- na_interpolation(atm1$Cash)

#plot imputation
ggplot_na_imputations(atm1$Cash, imputation) +
  my_plot_theme

#incorporation imputation
atm1$Cash <- na_interpolation(atm1$Cash)

ATM2

And again for ATM2 we see reasonable values for the default imputation method.

#store imputation
imputation <- na_interpolation(atm2$Cash)

#plot imputation
ggplot_na_imputations(atm2$Cash, imputation) +
  my_plot_theme

#incorporation imputation
atm2$Cash <- imputation

#2 Visualize the data

I plot each ATM as a boxplot below to get a better look at the distributions. ATM1 and ATM2 seem the most normally distributed, though there are quite a few outliers on the low end in ATM1. Further, ATM1 has the highest median of the three ATMs. Finally, ATM4 has a very small range of values, however, taking note of the scale this is an optics issue caused by that extreme outlier well over $900,000. I’ve stumbled on another data cleaning issue.

atm1_box <- ggplot(atm1, aes(x = Cash)) +
                     geom_boxplot()

atm2_box <- ggplot(atm2, aes(x = Cash)) +
                     geom_boxplot()

atm4_box <- ggplot(atm4, aes(x = Cash)) +
                     geom_boxplot()
  
  
ggarrange(atm1_box, atm2_box, atm4_box,
  labels = c("ATM1", "ATM2", "ATM4"),
  ncol = 1) +
  my_plot_theme

Check for outliers

A nice function from the forecast package allows it to see the number of cases that are outliers. The function tsoutliers() has further documentation here, but essentially is better suited to detected outliers of a time series more than standard data as it decomposes the series and considers seasonality and trend before determining outliers. I’ll now convert each ATM dataframe into a tsibble object so I can use the forecast package’s functionality moving forward.

ATM1 has over 30 outliers, which is consistent with the boxplot above. We also see the recommended replacement values. While this are statistically identified as outliers, there are so many it would be unwise to remove and/or replace these values. It’s good to know these exist, but I won’t fiddle with them.

#make tsibble version that works with tsoutliers
atm1 <- as.data.frame(atm1)
atm1_ts <- atm1 %>%
  select(-DATE) %>%
  mutate(Date = as.Date("2009-05-01") + 0:364) %>%
  as_tsibble(index = Date, key = Cash)

#check for outliers
tsoutliers(atm1_ts$Cash)
## $index
##  [1]  45  46  47  48  49  50  51  52  53  54  55  56  62  63  64  65  66  67  68
## [20]  69  70  71  72  73  74 352 353 354 355 356 357 358 359 360 361 362 363 364
## [39] 365
## 
## $replacements
##  [1]  20.46154  21.92308  23.38462  24.84615  26.30769  27.76923  29.23077
##  [8]  30.69231  32.15385  33.61538  35.07692  36.53846  51.28571  52.57143
## [15]  53.85714  55.14286  56.42857  57.71429  59.00000  60.28571  61.57143
## [22]  62.85714  64.14286  65.42857  66.71429 138.00000 138.00000 138.00000
## [29] 138.00000 138.00000 138.00000 138.00000 138.00000 138.00000 138.00000
## [36] 138.00000 138.00000 138.00000 138.00000
#make tsibble version that will work with autoplot later
atm1_ts <- as.data.frame(atm1_ts)

atm1_ts <- tsibble(
  Date = as.Date("2009-05-01") + 0:364,
  Cash = atm1$Cash)

ATM2 has far fewer outliers, only 5. These didn’t show up on the boxplot above, but that is likely because the tsoutliers() function found outliers with regard to the seasonality and/or trend, whereas the box-plot is just looking at raw numbers and a raw distribution. While these values are certainly high, I don’t belive they are in error so I won’t remove/replace these either.

#make tsibble version that works with tsoutliers
atm2 <- as.data.frame(atm2)
atm2_ts <- atm2 %>%
  select(-DATE) %>%
  mutate(Date = as.Date("2009-05-01") + 0:364) %>%
  as_tsibble(index = Date, key = Cash)

#check for outliers
tsoutliers(atm2_ts$Cash)
## $index
## [1] 361 362 363 364 365
## 
## $replacements
## [1] 136 136 136 136 136
#make tsibble version that will work with autoplot later
atm2_ts <- as.data.frame(atm2_ts)

atm2_ts <- tsibble(
  Date = as.Date("2009-05-01") + 0:364,
  Cash = atm2$Cash)

ATM4 has the highest number of outliers we’ve seen, and it’s unclear from this readout which is the extreme value seen in the boxplot.

#make tsibble version that works with tsoutliers
atm4 <- as.data.frame(atm4)
atm4_ts <- atm4 %>%
  select(-DATE) %>%
  mutate(Date = as.Date("2009-05-01") + 0:364) %>%
  as_tsibble(index = Date, key = Cash)

#check for outliers
tsoutliers(atm4_ts$Cash)
## $index
##  [1] 300 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330
## [20] 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349
## [39] 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365
## 
## $replacements
##  [1] 804 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847
## [20] 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847
## [39] 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847
#make tsibble version that will work with autoplot later
atm4_ts <- as.data.frame(atm4_ts)

atm4_ts <- tsibble(
  Date = as.Date("2009-05-01") + 0:364,
  Cash = atm4$Cash)

I filter by values where Cash is greater than 10,000 and find this on the date 02-09-2010. A this value is so extreme I’ll replace it with the median.

#finder outlier
atm4_ts %>%
  filter(Cash > 10000)
## # A tsibble: 1 x 2 [1D]
##   Date        Cash
##   <date>     <dbl>
## 1 2010-02-09 10920
#set outlier to NA for now
atm4_ts$Cash[atm4_ts$Cash=="10920"]<-NA

#calculate median w/o extreme outlier
atm4_median <- median(atm4_ts$Cash, na.rm = TRUE)

#put median in for outlier/NA value
atm4_ts$Cash[atm4_ts$Cash==NA]<-atm4_median

I revisit the boxplot and see a few outliers on the high end, but nothing as extreme as I saw earlier.

ggplot(atm4_ts, aes(x = Cash)) +
  geom_boxplot() +
  my_plot_theme

Consider Transformations

Now I can proceed looking for trend, seasonality, and/or the need for Box-Cox transformation on each of the ATMs’ time series data.

ATM1

A plot of the cleaned ATM1 data shows some changes in variance and certainly seasonality.

autoplot(atm1_ts, Cash)

In order to better regulate the variation in the plot above, I’ll perform a Box-Cox transformation. In this case lambda = 0.35. The plot still has plenty of seasonality, but the variance is more stable.

#calculate lambda
lambda1 <- atm1_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)


#plug into autoplot
ggplotly(atm1_ts %>%
    autoplot(box_cox(Cash, lambda1)) +
     labs(y = "",
       title = "Transformed ATM1 Cash with lambda = 0.35"))
#apply transformation
atm1_tf <-
    atm1_ts %>% 
    mutate(Cash = box_cox(Cash, lambda1))

Now I’ll check the residuals. I clearly see lags on day 7, 14, and 21 on the ACF plot, which makes sense as it’s likely seasonal around the days of the week. I’ll definitely need a model that incorporates this seasonality.

#check residuals
ggtsdisplay(atm1_tf$Cash, main="ATM1 Residuals after Box-Cox")

ATM2

A plot of the cleaned ATM2 data shows some changes in variance and certainly seasonality - similar to ATM

autoplot(atm2_ts, Cash)

In order to better regulate the variation in the plot above, I’ll perform a Box-Cox transformation. In this case lambda = 0.68. The plot still has plenty of seasonality, but the variance is more stable.

#calculate lambda
lambda2 <- atm2_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)


#plug into autoplot
ggplotly(atm2_ts %>%
    autoplot(box_cox(Cash, lambda2)) +
     labs(y = "",
       title = "Transformed ATM2 Cash with lambda = 0.68"))
#apply transformation
atm2_tf <-
    atm2_ts %>% 
    mutate(Cash = box_cox(Cash, lambda2))

Now I’ll check the residuals. Again I see lags at multiples of 7, but also more prominent ones at 2, 5, and 9 that will have to be checked on.

#check residuals
ggtsdisplay(atm2_tf$Cash, main="ATM2 Residuals after Box-Cox")

ATM4

A plot of the cleaned ATM4 data shows seasonality with some of the most variance we’ve seen so far.

autoplot(atm4_ts, Cash)

In order to better regulate the variation in the plot above, I’ll perform a Box-Cox transformation. In this case lambda = 0.40. The plot still has plenty of seasonality, but the variance is more stable.

#calculate lambda
lambda4 <- atm4_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)


#plug into autoplot
ggplotly(atm4_ts %>%
    autoplot(box_cox(Cash, lambda4)) +
     labs(y = "",
       title = "Transformed ATM4 Cash with lambda = 0.40"))
#apply transformation
atm4_tf <-
    atm4_ts %>% 
    mutate(Cash = box_cox(Cash, lambda4))

Now I’ll check the residuals. I see primarily lags at multiples of 7, not at the other intervals seen in ATM2.

#check residuals
ggtsdisplay(atm4_tf$Cash, main="ATM4 Residuals after Box-Cox")

#3 Choose your model

It’s certainly all of our ATMs will need modeling that incorporated their seasonality around the weekdays. I’ll try the simplest option, the Seasonal Naive Method first.

ATM1

atm1_tf %>%
  model(
    `SNAIVE` = SNAIVE(Cash)) %>%
  forecast(h = 15) %>%
  autoplot(atm1_tf) +
  labs(title = "ATM1 - Seasonal Naive Model",
       y = "USD in Hundreds")

atm1_tf %>%
  model(
    `Winter-Holts` = ETS(Cash ~ error("A") +
                       trend("A") + season("M"))) %>%
  forecast(h = 15) %>%
  autoplot(atm1_tf) +
  labs(title = "ATM1 - Seasonal Naive Model",
       y = "USD in Hundreds")

ATM2

ATM4

#4 Train the model

ATM1

ATM2

ATM4

#5 Evaluate model performance

ATM1

ATM2

ATM4

#6 Produce forecasts

ATM1

ATM2

ATM4

Part B - Power Usage Forecast

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

#1 Data preparation

Look at a summary of the data we see there are 3 variables, CaseSequence showing the order of the YYYY-MMM variable, and finally the KWH variable showing the power usage. KWH has one missing value we will want to impute before we proceed.

power_raw <- read_csv("power_raw.csv")

skim(power_raw)
Data summary
Name power_raw
Number of rows 192
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
YYYY-MMM 0 1 8 8 0 192 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CaseSequence 0 1.00 828.5 55.57 733 780.75 828.5 876.25 924 ▇▇▇▇▇
KWH 1 0.99 6502474.6 1447570.89 770523 5429912.00 6283324.0 7620523.50 10655730 ▁▁▇▅▁
#start cleaned dataset
power <- power_raw

#transform date
power <- power %>%
  mutate(Month = yearmonth(`YYYY-MMM`)) %>%
  select(-`YYYY-MMM`)

Imputation

Below I use the imputeTS package to generate a value for the identified NA value, using the linear (default) method. The point seems reasonable and to take into account the seasonality we see in the data so I incorporate it.

#store imputation
imputation <- na_interpolation(power$KWH)
#plot imputation
ggplot_na_imputations(power$KWH, imputation) +
  my_plot_theme

#incorporate imputation
power$KWH <- imputation

#2 Visualize the data

I start by visualizing the data with a boxplot and see there is one low outlier we may want to check on.

power %>%
  subset(select = KWH) %>%
  ggplot(aes(KWH)) +
  geom_boxplot() +
  labs(title = "Checking KWH Distribution") +
  my_plot_theme

power %>%
  subset(select = KWH) %>%
  ggplot(aes(KWH)) +
  geom_histogram() +
  labs(title = "Checking KWH Distribution") +
  my_plot_theme

Check for Outliers

The outlier shown above on the box plot appears to be from July 2010. In a real-world setting I would be curious to know from coworkers if there was a large outage for part of this month that decreased consumption, or possible a measurement error. As I don’t have the ability to do that, and it’s the only significant outlier, I’m going to replace it with the median.

power %>%
  filter(KWH < 1000000)
## # A tibble: 1 x 3
##   CaseSequence    KWH    Month
##          <dbl>  <dbl>    <mth>
## 1          883 770523 2010 Jul
#set outlier to NA for now
power$KWH[power$KWH=="770523"]<-NA

#calculate median w/o extreme outlier
power_median <- median(power$KWH, na.rm = TRUE)

#put median in for outlier/NA value
power$KWH[power$KWH==NA]<-power_median

Checking the boxplot again the distribution appears much more normal with still a fair amount of variation.

ggplot(power, aes(x = KWH)) +
  geom_boxplot() +
  my_plot_theme

Consider Transformations

After transforming out data to a tsibble, a quick check of the line graph shows we’ll want to do a Box-Cox transformation to address the variance over time.

#transfer to tsibble
power_ts <- tsibble(
  Month = yearmonth(power$Month),
  KWH = power$KWH,
)


autoplot(power_ts)

The Box-Cox results in a lambda of -0.23 and we can see the transformation addresses the variance seen above.

#calculate lambda
lambda <- power_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)


#plug into autoplot
ggplotly(power_ts %>%
    autoplot(box_cox(KWH, lambda)) +
     labs(y = "",
       title = "Transformed KWH with lambda = -0.23"))
#apply transformation
power_tf <-
    power_ts %>% 
    mutate(KWH = box_cox(KWH, lambda))

The residuals show some seasonality, possibly lower usage in month 4 (April) and high usage in in month 7 (July). This would make sense if it’s data from a region in the USA.

#check residuals
ggtsdisplay(power_ts$KWH, main="KWH Residuals after Box-Cox")

#3 Choose your model

#4 Train the model

#5 Evaluate model performance

#6 Produce forecasts